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Abstract 



Many fundamental multi-processor coordination problems can be expressed 
as counting problems: processes must cooperate to assign successive values 
from a given range, such as addresses in memory or destinations on an in- 
terconnection network. Conventional solutions to these problems perform 
poorly because of synchronization bottlenecks and high memory contention. 

Motivated by observations on the behavior of sorting networks, we offer 
a new approach to solving such problems, by introducing counting networks, 
a new class of networks that can be used to count. 

We give two counting network constructions, one of depth logn(l + 
logn)/2 using n log n(l + log n)/4 "gates," and a second of depth log^ n using 
nlog^n/2 gates. These networks avoid the sequential bottlenecks inherent 
to earlier solutions, and substantially lower the memory contention. 

Finally, to show that counting networks are not merely mathematical 
creatures, we provide experimental evidence that they outperform conven- 
tional synchronization techniques under a variety of circumstances. 



This report supersedes CRL Tech Report 90/11. A preliminary version of 
this work appeared in the Proceedings of the 23rd ACM Symposium on the 
Theory of Computing, New Orleans, May 1991. 

Keywords: Counting Networks, Parallel Processing, Hot-Spots, Network 
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1 Introduction 

Many fundamental multi-processor coordination problems can be expressed 
as counting problems: processors collectively assign successive values from 
a given range, such as addresses in memory or destinations on an intercon- 
nection network. In this paper, we offer a new approach to solving such 
problems, by introducing counting networks, a new class of networks that 
can be used to count. 

Counting networks, like sorting networks [4, 7, 8], are constructed from 
simple two-input two-output computing elements called balancers, connected 
to one another by wires. However, while an n input sorting network sorts a 
collection of n input values only if they arrive together, on separate wires, 
and propagate through the network in lockstep, a counting network can count 
any number TV ^ n of input tokens even if they arrive at arbitrary times, 
are distributed unevenly among the input wires, and propagate through the 
network asynchronously. 

Figure 2 provides an example of an execution of a 4-input, 4-output, 
counting network. A balancer is represented by two dots and a vertical line 
(see Figure 1). Intuitively, a balancer is just a toggle mechanism alternately 
forwarding inputs to its top and bottom output wires. It thus balances the 
number of tokens on its output wires. In the example of Figure 2, input tokens 
arrive on the network's input wires one after the other. For convenience we 
have numbered them by the order of their arrival (these numbers are not 
used by the network). As can be seen, the first input (numbered 1) enters 
on line 2 and leaves on line 1, the second leaves on line 2, and in general, 
the Nth token will leave on line N mod 4. (The reader is encouraged to try 
this for him/herself.) Thus, if on the ith output line the network assigns to 
consecutive outputs the numbers 2, 2 + 4, 2 + 2 • 4, .., it is counting the number 
of input tokens without ever passing them all through a shared computing 
element! 

Counting networks achieve a high level of throughput by decomposing 
interactions among processes into pieces that can be performed in parallel. 
This decomposition has two performance benefits: It eliminates serial bottle- 
necks and reduces memory contention. In practice, the performance of many 

■"^One can implement a balancer using a read-modify- write operation such as Compare 
& Swap, or a short critical section. 
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shared-memory algorithms is often hmited by conflicts at certain widely- 
shared memory locations, often called hot spots [30]. Reducing hot-spot 
conflicts has been the focus of hardware architecture design [15, 16, 22, 29] 
and experimental work in software [5, 13, 14, 25, 27]. 

Counting networks are also non-blocking: processes that undergo halt- 
ing failures or delays while using a counting network do not prevent other 
processes from making progress. This property is important because ex- 
isting shared-memory architectures are themselves inherently asynchronous; 
process step times are subject to timing uncertainties due to variations in 
instruction complexity, page faults, cache misses, and operating system ac- 
tivities such as preemption or swapping. 

Section 2 deflnes counting networks. In Sections 3 and 4, we give two 
distinct counting network constructions, each of depth less than or equal to 
log^n, each using less than or equal to (nlog^n)/2 balancers. To illustrate 
that counting networks are useful we use counting networks to construct 
high-throughput shared-memory implementations of concurrent data struc- 
tures such as shared counters, producer/consumer buffers, and barriers. A 
shared counter is simply an object that issues the numbers 0 to m — 1 in re- 
sponse to m requests by processes. Shared counters are central to a number 
of shared-memory synchronization algorithms (e.g., [10, 12, 16, 31]). A pro- 
ducer/consumer buffer is a data structure in which items inserted by a pool 
of producer processes are removed by a pool of consumer processes. A barrier 
is a data structure that ensures that no process advances beyond a partic- 
ular point in a computation until all processes have arrived at that point. 
Compared to conventional techniques such as spin locks or semaphores, our 
counting network implementations provide higher throughput, less memory 
contention, and better tolerance for failures and delays. The implementations 
can be found in Section 5. 

Our analysis of the counting network construction is supported by exper- 
iment. In Section 6, we compare the performance of several implementations 
of shared counters, producer/consumer buffers, and barrier synchronization 
on a shared-memory multiprocessor. When the level of concurrency is suffi- 
ciently high, the counting network implementations outperform conventional 
implementations based on spin locks, sometimes dramatically. Finally, Sec- 
tion 7 describes how to mathematically verify that a given network counts. 

In summary, counting networks represent a new class of concurrent al- 
gorithms. They have a rich mathematical structure, they provide effective 
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Figure 1: A Balancer. 



solutions to important problems, and they perform well in practice. We 
believe that counting networks have other potential uses, for example as in- 
terconnection networks [32] or as load balancers [28], and that they deserve 
further attention. 



2 Networks That Count 
2.1 Counting Networks 

Counting networks belong to a larger class of networks called balancing net- 
works, constructed from wires and computing elements called balancers, in a 
manner similar to the way in which comparison networks [8] are constructed 
from wires and comparators. We begin by describing balancing networks. 

A balancer is a computing element with two input wires and two output 
wires^ (see Figure 1). Tokens arrive on the balancer's input wires at arbitrary 
times, and are output on its output wires. Intuitively, one may think of a bal- 
ancer as a toggle mechanism, that given a stream of input tokens, repeatedly 
sends one token to the top output wire and one to the bottom, effectively 
balancing the number of tokens that have been output on its output wires. 
We denote by x^, i G {0, 1} the number of input tokens ever received on the 
balancer's ith input wire, and similarly by y^, i G {0, 1} the number of tokens 
ever output on its ith output wire. Throughout the paper we will abuse this 
notation and use Xi [yi) both as the name of the ith input (output) wire and 
a count of the number of input tokens received on the wire. 

Let the state of a balancer at a given time be defined as the collection of 
tokens on its input and output wires. For the sake of clarity we will assume 



■^In Figure 1 as well as in the sequel, we adopt the notation of [8] and and draw wires 
as horizontal lines with balancers stretched vertically. 
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that tokens are all distinct. We denote by the pair (t, 6), the state transition 
in which the token t passes from an input wire to an output wire of the 
balancer b. 

We can now formally state the safety and liveness properties of a balancer: 

1. In any state Xq -\- Xi > yo + yi (i.e. a balancer never creates output 
tokens). 

2. Given any finite number of input tokens m = Xq + Xi to the balancer, 
it is guaranteed that within a finite amount of time, it will reach a 
quiescent state, that is, one in which the sets of input and output 
tokens are the same. In any quiescent state, Xq + Xi = yo + 2/i = 'tt-- 

3. In any quiescent state, yo = \'m/2~\ and yi = [m/2j. 

A balancing network of width is a collection of balancers, where out- 
put wires are connected to input wires, having w designated input wires 
Xo, Xi, .., Xu;_i (which are not connected to output wires of balancers), w des- 
ignated output wires yo,yi, --^yw-i (similarly unconnected), and containing 
no cycles. Let the state of a network at a given time be defined as the union 
of the states of all its component balancers. The safety and liveness of the 
network follow naturally from the above network definition and the proper- 
ties of balancers, namely, that it is always the case that YX^o ^ Yh=o Viy 
and for any finite sequence of m input tokens, within finite time the network 
reaches a quiescent state, i.e. one in which YX^o Vi — 

It is important to note that we make no assumptions about the timing 
of token transitions from balancer to balancer in the network — the net- 
work's behavior is completely asynchronous. Although balancer transitions 
can occur concurrently, it is convenient to model them using an interleaving 
semantics in the style of Lynch and Tuttle [24]. An execution of a network 
is a finite sequence Sq, ei, 5i, . . . e„, 5„ or infinite sequence Sq, ei, 5i, . . . of al- 
ternating states and balancer transitions such that for each (5^, ei+i, s^+i), 
the transition ei+i carries state Si to Si^i. A schedule is the subsequence of 
transitions occurring in an execution. A schedule is valid if it is induced by 
some execution, and complete if it is induced by an execution which results 
in a quiescent state. A schedule s is sequential if for any two transitions 
ei = {ti,bi) and Cj = {tj,bj), where ti and tj are the same token, then all 
transitions between them also involve that token. 



2 NETWORKS THAT COUNT 



5 



1 

outputs 



inputs 



4 3 



5 1 



7 6 2 * 




Figure 2: A sequential execution for a BlTONIc[4] counting network. 

On a shared memory multiprocessor, a balancing network is implemented 
as a shared data structure, where balancers are records, and wires are pointers 
from one record to another. Each of the machine's asynchronous processors 
runs a program that repeatedly traverses the data structure from some input 
pointer (either preassigned or chosen at random) to some output pointer, 
each time shepherding a new token through the network (see section 5). 

We define the depth of a balancing network to be the maximal depth of 
any wire, where the depth of a wire is defined as 0 for a network input wire, 
and 

max[depth[xo), depth[xi)) + 1 

for the output wires of a balancer having input wires Xq and Xi. We can thus 
formulate the following straightforward yet useful lemma: 

Lemma 2.1 // the transition of a token from the input to the output by any 
balancer (including the time spent traversing the input wire) takes at most A 
time, then any input token will exit the network within time at most A times 
the network depth. 

A counting network of width is a balancing network whose outputs 
2/oj --^yw-i satisfy the following step property: 

In any quiescent state, 0 < yi — yj < 1 for any i < j . 
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To illustrate this property, consider an execution in which tokens traverse 
the network sequentially, one completely after the other. Figure 2 shows such 
an execution on a BlTONIc[4] counting network which we will define formally 
in Section 3. As can be seen, the network moves input tokens to output wires 
in increasing order modulo w. Balancing networks having this property are 
called counting networks because they can easily be adapted to count the 
total number of tokens that have entered the network. Counting is done by 
adding a "local counter" to each output wire 2, so that tokens coming out of 
that wire are consecutively assigned numbers i,i-\-w, . . . ,i-\-[yi — l)w. (This 
application is described in greater detail in Section 5.) 

The step property can be defined in a number of ways which we will use 
interchangeably. The connection between them is stated in the following 
lemma: 

Lemma 2.2 // yoj • • • j 2/u;-i a sequence of non-negative integers, the fol- 
lowing statements are all equivalent: 

1. For any i<j,0<yi— yj<l. 

2. Either yi = yj for all or there exists some c such that for any i < c 
and j > c, y^ - yj = 1. 

3. Ifm = Et-oy^, y^= 

It is the third form of the step property that makes counting networks usable 
for counting. 

Proof: We will prove that 3 implies 1, 1 implies 2, and 2 implies 3. 

For any indexes a < h, since 0<a<6<io, it must be that 0 < 
\^] - \^] ^ 1- Thus 3 implies 1. 

Assume 1 holds for the sequence yo, . . . , y^-i- If for every 0 < 2 < j < lo, 
Vi ~ Vj — 0) then 2 follows. Otherwise, there exists the largest a such that 
there is a 6 for which a < b and ya — Vb = ^- From a's being largest we get 
that ya-Va+i = 1, and from 1 we get yi = ya for any 0 < 2 < a and yi = ya+i 
for any a -\- 1 < i < w. Choosing c = a + 1 completes the proof. Thus 1 
implies 2. 

Assume by way of contradiction that 3 does not hold and 2 does. Without 
loss of generality, there thus exists the smallest a such that m = Y^^Sq yi and 
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^ [^^^j • H 2/a < [^^^j ) then since X^JLq Vi = '"^j simple arithmetic 

there must exist ab> a such that yt > \^] . Since 0 < - [^j < 1, 

Vb — ya ^ Ij and no c as in 2 exists, a contradiction. Similarly, if ya > [^^^j , 

there exists a b ^ a such that yf, < , and ya — Vb ^ 2. Again no c as in 

2 exists, a contradiction. Thus 2 implies 3. 



The requirement that a quiescent counting network's outputs have the 
step property might appear to tell us little about the behavior of a counting 
network during an asynchronous execution, but in fact it is surprisingly pow- 
erful. Even in a state in which many tokens are passing through the network, 
the network must eventually settle into a quiescent state if no new tokens 
enter the network. This constraint makes it possible to prove such important 
properties as the following: 

Lemma 2.3 Suppose that in a given execution a counting network with out- 
put sequence yo, . . . ,yw-i is in a state where m tokens have entered the net- 
work and m' tokens have left it. Then there exist non-negative integers di, 
0 < i < w , such that Y^f^Q di = m — m' and yi -\- di = • 

Proof: Suppose not. There is some execution e for which the non-negative 
integers di, 0 < i < w do not exist. If we extend e to a complete execution e' 
allowing no additional tokens to enter the network, then at the end of e' the 
network will be in a quiescent state where the step property does not hold, 
a contradiction. ■ 

In a sequential execution, where tokens traverse the network one at a time, 
the network is quiescent every time a token leaves. In this case the 2-th token 
to enter will leave on output i mod w. The lemma shows that in a concurrent, 
asynchronous execution of any counting network, any "gap" in this sequence 
of mod w counts corresponds to tokens still traversing the network. This 
critical property holds in any execution, even if quiescent states never occur, 
and even though the definition makes no explicit reference to non-quiescent 
states. 
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Figure 3: Recursive Structure of a BlTONIc[8] Counting Network. 



2.2 Counting vs. Sorting 

A balancing network and a comparison network are isomorphic if one can 
be constructed from the other by replacing balancers by comparators or vice 
versa. The counting networks introduced in this paper are isomorphic to 
the Bitonic sorting network of Batcher [7] and to the Periodic Balanced 
sorting network of Dowd, Perl, Rudolph and Saks [9]. There is a sense in 
which constructing counting networks is "harder" than constructing sorting 
networks: 

Theorem 2.4 If a balancing network counts, then its isomorphic compari- 
son network sorts, but not vice versa. 

Proof: It is easy to verify that balancing networks isomorphic to the EVEN- 
Odd or Insertion sorting networks [8] are not counting networks. 

For the other direction, we construct a mapping from the comparison 
network transitions to the isomorphic balancing network transitions. 

By the 0-1 principle [8], a comparison network which sorts all sequences 
of O's and I's is a sorting network. Take any arbitrary sequence of O's and I's 
as inputs to the comparison network, and for the balancing network place a 
token on each 0 input wire and no token on each 1 input wire. We now show 
that if we run both networks in lockstep, the balancing network will simulate 
the comparison network, that is, the correspondence between tokens and O's 
holds. 
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The proof is by induction on the depth of the network. For level 0 the 
claim holds by construction. Assuming it holds for wires of a given level k, 
let us prove it holds for level k -\- 1. On every gate where two O's meet in 
the comparison network, two tokens meet in the balancing network, so one 
0 leaves on each wire in the comparison network on level k -\- 1, and one 
token leaves on each line in the balancing network on level k -\- 1. On every 
gate where two I's meet in the comparison network, no tokens meet in the 
balancing network, so a 1 leaves on each level k -\- 1 wire in the comparison 
network, and no tokens leave in the balancing network. On every gate where 
a 0 and 1 meet in the comparison network, the 0 leaves on the lower wire 
and the 1 on the upper wire on level k -\- 1, while in the balancing network 
the token leaves on the lower wire, and no token leaves on the upper wire. 

If the balancing network is a counting network, i.e., it has the step prop- 
erty on its output level wires, then the comparison network must have sorted 
the input sequence of O's and I's. ■ 

Corollary 2.5 The depth of any counting network is at least Q,(logn). 

Though in general a balancing network isomorphic to a sorting network 
is not guaranteed to count, its outputs will always have the step property if 
the input sequence satisfies the following smoothness property: 

A sequence Xq, x^j^i is smooth if for all i < j , \xi — Xj\ < 1. 

This observation is stated formally below: 

Theorem 2.6 If a balancing network is isomorphic to a sorting network, 
and its input sequence is smooth, then its output sequence in any quiescent 
state has the step property. 

Proof: The proof follows along the lines of Theorem 2.4. We will show 
the result by constructing a mapping, this time from the transitions of the 
balancing network to the transitions of the isomorphic sorting network. How- 
ever, unlike in the proof of Theorem 2.4, we will map sets of transitions of 
the balancing network to single transitions of the isomorphic sorting network. 
We do this by considering the number of tokens that have passed along each 
wire of a balancing network in an execution ending in a quiescent state. From 
this perspective the transitions of a balancer gate can be mapped to those of 
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a mathematical device that receives integers Xq and Xi (numbers of tokens) 
and outputs integers [^^j and [^^J . 

Given that the input sequence to the balancing network is smooth, there 
is a quantity x such that each input wire carries either x or x + 1 tokens. 
By simple induction on the depth of the network, one can prove that the 
inputs and outputs of any balancer in a network with x or x + 1 tokens on 
each input wire, will have as outputs x or x + 1 tokens, and that for a given 
balancer: 

1. If both input wires have x tokens, then both outputs will have x. 

2. If one input has x and the other has x + 1, then the output on the top 
wire will be x + 1 tokens and on the bottom wire it will be x tokens. 

3. If both input wires have x + 1 tokens, then both output wires will have 
X + 1 tokens. 

This behavior, if one considers x and x + 1 as integers, maps precisely 
to that of comparators of numeric values in a comparison network. Conse- 
quently, in a quiescent state of a balancing network isomorphic to a sorting 
network, if the network as a whole was given a smooth input sequence, its 
output sequence must map to a sorted sequence of integers x and x + 1, 
implying that it has the step property. ■ 

3 A Bitonic Counting Network 

Naturally, counting networks are interesting only if they can be constructed. 
In this section we describe how to construct a counting network whose width 
is any power of 2. The layout of this network is isomorphic to Batcher's 
famous Bitonic sorting network [7, 8], though its behavior and correctness 
arguments are completely different. We give an inductive construction, as 
this will later aid us in proving its correctness. 

Define the width w balancing network Merger[io] as follows. It has 
two sequences of inputs of length w/2, x and x', and a single sequence of 
outputs y, of length w. Merger[io] will be constructed to guarantee that in 
a quiescent state where the sequences x and x' have the step property, y will 
also have the step property, a fact which will be proved in the next section. 
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Figure 4: A MERGER [8] balancing network. 

We define the network Merger[io] inductively (see example in Figure 4). 
Since 10 is a power of 2, we will repeatedly use the notation 2k in place of w. 
When k is equal to 1, the Merger[2A;] network consists of a single balancer. 
For k > 1, we construct the Merger[2A;] network with input sequences x 
and x' from two Merger[A;] networks and k balancers. Using a Merger[A;] 
network we merge the even subsequence Xo,X2, ■ ■ ■ , Xk-2 of x with the odd 
subsequence x[, X3, . . . , x'i^_-^ of x' (i.e., the sequence Xq, . . . , Xk-2, . . . , x'j^_-^ 
is the input to the Merger[A;] network) while with a second Merger[A;] 
network we merge the odd subsequence of x with the even subsequence of 
x' . Call the outputs of these two MERGER [A;] networks z and z' . The final 
stage of the network combines z and z' by sending each pair of wires Zi and 
z[ into a balancer whose outputs yield y2i and y2i+i- 

The Merger[io] network consists of logio layers of io/2 balancers each. 
Merger[io] guarantees the step property on its outputs only when its inputs 
also have the step property — but we can ensure this property by filtering 
these inputs through smaller counting networks. We define BlTONIc[io] to 
be the network constructed by passing the outputs from two BlTONIc[io/2] 
networks into a Merger[io] network, where the induction is grounded in 
the BlTONIc[l] network which contains no balancers and simply passes its 
input directly to its output. This construction gives us a network consisting 
of n°s"'+i j layers each consisting of io/2 balancers. 
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3.1 Proof of Correctness 

In this section we show that BlTONIc[io] is a counting network. Before ex- 
amining the network itself, we present some simple lemmas about sequences 
having the step property. 

Lemma 3.1 If a sequence has the step property, then so do all its subse- 
quences. 

Lemma 3.2 If Xq, . . . ,Xk-i has the step property, then its even and odd 
subsequences satisfy: 



fc/2-l 
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k-l 



i=0 



k/2-1 



an 



d 



i=0 



k-l 



.i=0 



Proof: Either X2i = X2i+i for 0 < 2 < k/2, or by Lemma 2.2 there exists a 
unique j such that X2j = X2j+i + 1 and X2i = X2i+i for alii ^ j , 0 < i < k/2. 
In the first case, J2x2i = J2x2i+i = S 33^/2, and in the second case J2x2i = 
\J2x^/2] and J2x2^+l = l'Ex^/2\. ■ 

Lemma 3.3 Let Xq, . . . , x^-i and yo, . . . ,yk-i be arbitrary sequences having 
the step property. IfYliZo = YliZo Vii then Xi = yi for all 0 < i < k. 

Proof: Let m = Y^Xi = Y^yi. By Lemma 2.2, Xi = yi = [^^^j- ■ 

Lemma 3.4 Let Xq, . . . , x^-i and yo, . . . ,yk-i be arbitrary sequences having 
the step property. If YiZo = YliZo Vi + 1; then there exists a unique j, 
0 < j < k, such that Xj = yj -\- 1, and Xi = yi for i^j,0<i<k. 

Proof: Let m = Yxi = Yyi + 1. By Lemma 2.2, Xi = and yi = 

[ ™~fc"^~' j ■ These two terms agree for all 2, 0 < 2 < k, except for the unique i 
such that 2 = m — 1 (mod k). ■ 

We now show that the MERGER [lo] networks preserves the step property. 

Lemma 3.5 // Merger[2A;] is quiescent, and its inputs Xo,...,Xk-i and 
Xq, . . . , x'i^_-^ both have the step property, then its outputs yo, . . . ,y2k-i have 
the step property. 
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Proof: We argue by induction on log k. 

If 2k = 2, Merger[2A;] is just a balancer, so its outputs are guaranteed 
to have the step property by the definition of a balancer. 

If 2k > 2, let Zo, . . . , Zk-i be the outputs of the first Merger[A;] subnet- 
work, which merges the even subsequence of x with the odd subsequence of 
x', and let z'q, . . . ,z'j^_i be the outputs of the second. Since x and x' have 
the step property by assumption, so do their even and odd subsequences 
(Lemma 3.1), and hence so do z and z' (induction hypothesis). Furthermore, 
Y.z^= \Y.x^l2~\ + LEa;:/2j and E-^: = \_Y.x^l2\ + \Y.KI'A (Lemma 3.2). A 
straightforward case analysis shows that Zi and z[ can differ by at most 
1. 

We claim that 0 < — < 1 for any i < j . = J2 ^[i then Lemma 

3.3 implies that Zi = z\ for 0 < 2 < kj2. After the final layer of balancers, 

y% - Vj = z\yi2\ - Z\jl2\-, 

and the result follows because z has the step property. 

Similarly, if ^■iid Y^^'i differ by one. Lemma 3.4 implies that Zi = z[ 
for 0 < 2 < kl2, except for a unique I such that zi and z'^ differ by one. Let 
max{^zi, z'f) = X + 1 and min{^zi, z'f) = x for some non-negative integer x. 
From the step property on z and z' we have, for all i < I, Zi = z[ = x -\- 1 
and for sll i > I Zi = z[ = x. Since zi and z'^ are joined by a balancer with 
outputs y2i and 2/2£+ij it follows that y2i = x + 1 and y2i+i = x. Similarly, 
Zi and zl for i ^ I are joined by the same balancer. Thus for any i < i, 
y2i = 2/2i+i = X + 1 and for any i > I, y2i = 2/2i+i = x. The step property 
follows by choosing c = 21 -\- 1 and applying Lemma 2.2. 

■ 

The proof of the following theorem is now immediate. 

Theorem 3.6 In any quiescent state, the outputs o/ BlTONIc[io] have the 
step property. 

4 A Periodic Counting Network 

In this section we show that the bitonic network is not the only counting net- 
work with depth 0[log^n). We introduce a new counting network with the 
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interesting property that it is periodic, consisting of a sequence of identical 
subnetworks. Each stage of this periodic network is interesting in its own 
right, since it can be used to achieve barrier synchronization with low con- 
tention. This counting network is isomorphic to the elegant balanced periodic 
sorting network of Dowd, Perl, Rudolph, and Saks [9]. However, its behavior, 
and therefore also our proof of correctness, are fundamentally different. 

We start by defining chains and cochains, notions taken from [9]. Given 
a sequence x = {xi\i = 0, . . . ,n — 1}, it is convenient to represent each index 
(subscript) as a binary string. A level i chain of x is a subsequence of x whose 
indices have the same i low-order bits. For example, the subsequence x^ of 
entries with even indices is a level 1 chain, as is the subsequence x^ of entries 
with odd indices. The A-cochain of x, denoted x^, is the subsequence whose 
indices have the two low-order bits 00 or 11. For example, the A-cochain of 
the sequence Xq, . . . , xy is Xq, X3, X4, xy. The B-cochain x^ is the subsequence 
whose low-order bits are 01 and 10. 

Define the network BLOCK [A;] as follows. When k is equal to 2, the 
Block [A;] network consists of a single balancer. The BLOCK [2A;] network for 
larger k is constructed recursively. We start with two BLOCK [A;] networks A 
and B. Given an input sequence x, the input to A is x^, and the input to 
B is x^ . Let y be the output sequence for the two subnetworks, where 
is the output sequence for A and the output sequence for B. The final 
stage of the network combines each yf and yf in a single balancer, yielding 
final outputs Z2i and Z2i+i. Figure 5 describes the recursive construction of a 
Block [8] network. The PERIODIC [2A;] network consists of log A; BLOCK [2A;] 
networks joined so that the i*^ output wire of one is the i*^ wire of the next. 
Figure 6 is a PERIODIC [8] counting network ^ 

This recursive construction is quite different from the one used by Dowd 
et al. We chose this construction because it yields a substantially simpler 
and shorter proof of correctness. 

4.1 Proof of Correctness 

In the proof we use the technical lemmas about input and output sequences 
presented in Section 3. The following lemma will serve a key role in the 

^Despite the apparent similarities between the layouts of the Block and Merger 
networks, there is no permutation of wires that yields one from the other. 
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inductive proof of our construction: 

Lemma 4.1 Fori > 1, 

1. The level i chain of x is a level i — 1 chain of one of x 's cochains. 

2. The level i chain of a cochain of x is a level 2 + 1 chain of x . 

Proof: Follows immediately from the definitions of chains and cochains. ■ 

As will be seen, the price of modularity is redundancy, that is, balancers in 
lower level blocks will be applied to sub-sequences that already have the de- 
sired step property. We therefore present the following lemma that amounts 
to saying that applying balancers "evenly" to such sequences does not hurt: 

Lemma 4.2 // x and x' are sequences each having the step property, and 
pairs Xi and x'- are routed through a balancer, yielding outputs yi and y'-, then 
the sequences y and y' each have the step property. 

Proof: For any i < j, given that x and x' have the step property, 0 < 
Xi — Xj < 1 and 0 < — x^- < 1 and therefore the difference between any 

two wires is 0 < + — (xj + x'-) < 2. By definition, for any i, yi = [" ^'^^' j 

and y'- = [^ ^'"^^' j , and so for any i < j , it is the case that 0 < yi — yj < 1 and 
0 < y'i — y'j < 1, implying the step property. ■ 

To prove the correctness of our construction for PERIODIC [A;], we will 
show that if a block's level i input chains have the step property, then so 
do its level 2 — 1 output chains, for i in {0, . . . , log k — 1}. This observation 
implies that a sequence of log A; BLOCK [A;] networks will count an arbitrary 
number of inputs. 

Lemma 4.3 Let BLOCK [2A;] be quiescent with input sequence x and output 
sequence y. If x^ and x^ both have the step property, so does y. 

Proof: We argue by induction on log k. The proof is similar to that of 
Lemma 3.5. 

For the base case, when 2k = 2, BLOCK [2A;] is just a balancer, so its out- 
puts are guaranteed to have the step property by the definition of a balancer. 
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Figure 5: A BLOCK [8] balancing network. 



For the induction step, assume the result for BLOCK [A;] and consider a 
Block [2A;]. Let x be the input sequence to the block, z the output sequence 
of the nested blocks A and B, and y the block's final output sequence. The 
inputs to A are the level 2 chains x^^ and x^^ , and the inputs to B are x^^ 
and x^^ . By Lemma 4.1, each of these is a level 1 chain of or x^ . These 
sequences are the inputs to A and B, themselves of size k, so the induction 
hypothesis implies that the outputs 2;^ and of A and B each has the step 
property. 

Lemma 3.2 implies that 0 < T,xf^ - Y.xf° < 1 and 0 < Y.xf^ - 
Ea;?° < 1. It follows that the sum of A's inputs, Y.xf^ + Y.x°°, and the 
sum of 5's inputs, Y^xf^ + Y^xf^, differ by at most 1. Since balancers do 
not swallow or create tokens, Y S ^■Iso differ by at most 1. If they 



are equal, then Lemma 3.3 implies that zl 



Z2t = Z2Z+1- For i < j, 



y^ - Vo 



'\yl2\ 



'Lj72J 



and the result follows because z^ has the step property. 

Similarly, if Y S differ by one. Lemma 3.4 implies that zf = zf 

for 0 < 2 < A;, except for a unique I such that zf and zf differ by one. Let 

X for some non-negative integer x. 



max[z 



X + 1 and min^zf , zf] 



From the step property on 2;^ and z^ we have, for all i < £, zf = zf = x -\- 1 
and for alii > £ zf = zf = x. Since zf and zf are joined by a balancer with 
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outputs 2/2£ and y2i+i, it follows that = x + 1 and 2/2£+i = x. Similarly, 
zf and zf for i ^ I are joined by the same balancer. Thus for any i < 
V2i = 2/2i+i = 33 + 1 and for any i > i, y2i = 2/2i+i = x. The step property 
follows by choosing c = 2i -\- 1 and applying Lemma 2.2. ■ 

Theorem 4.4 Lei BLOCK [2A;] 6e quiescent with input sequence x and output 
sequence y. If all the level i input chains to a block have the step property, 
then so do all the level i — 1 output chains. 

Proof: We argue by induction on i. Lemma 4.3 provides the base case, 
when 2 is 1. 

For the induction step, assume the result for chains up to 2 — 1. Let x be 
the input sequence to the block, z the output sequence of the nested blocks A 
and B, and y the block's final output sequence. If 2 > 1, Lemma 4.1 implies 
that every level i chain of x is entirely contained in one cochain or the other. 
Each level i chain of x contained in {x^) is a level i — 1 chain of {x^)j 
each has the step property, and each is an input to A (B). The induction 
hypothesis applied to A and B implies that the level i — 2 chains of and 
z^ have the step property. But Lemma 4.1 implies that the level i — 2 chains 
of z-^ and z^ are the level i — 1 chains of z. By Lemma 4.2, if the level i — 1 
chains of z have the step property, so do the level 2 — 1 chains of y. ■ 

By Theorem 2.4, the proof of Theorem 4.4 constitutes a simple alterna- 
tive proof that the balanced periodic comparison network of [9] is a sorting 
network. 

5 Implementation and Applications 

In a MIMD shared-memory architecture, a balancer can be represented as 
a record with two fields: toggle is a boolean value that alternates between 
0 and 1, and next is a 2-element array of pointers to successor balancers. 
A balancer is a leaf if it has no successors. A process shepherds a token 
through the network by executing the procedure shown in Figure 7. In our 
implementations, we preassigned processes to input lines so that they were 
evenly distributed. Thus, a given process always started shepherding tokens 
from the same preassigned line. It toggles the balancer's state, and visits 
the next balancer, halting when it reaches a leaf. The network's wiring 
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Figure 6: A PERIODIC [8] counting network. 
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balancer = [toggle: boolean, next: array [0..1] of ptr to balancer] 
traverse(b: balancer) 
loop until leaf(b) 

i := rmt(;(b. toggle := -> b. toggle) 
b := b.next[i] 
end loop 
end traverse 



Figure 7: Code for Traversing a Balancing Network 

information can be cached by each process, and so the transition time of 
a balancer will be almost entirely a function of the efficiency of the toggle 
implementation. Advancing the toggle state can be accomplished either by 
a short critical section guarded by a spin lock^, or by a read-modify-write 
operation [rmw for short) if the hardware supports it. Note that all values 
are bounded. 

We illustrate the utility of counting networks by constructing highly con- 
current implementations of three common data structures: shared counters, 
producer/consumer buffers, and barriers. In Section 6 we give some experi- 
mental evidence that counting network implementations have higher through- 
put than conventional implementations when contention is sufficiently high. 

5.1 Shared Counter 

A shared counter [12, 10, 16, 31] is a data structure that issues consecutive 
integers in response to increment requests. More formally, in any quiescent 
state in which m increment requests have been received, the values 0 to 
m — 1 have been issued in response. To construct the counter, start with 
an arbitrary width-io counting network. Associate an integer cell q with the 
i*^ output wire. Initially, q holds the value i. A process requests a number 
by traversing the counting network. When it exits the network on wire 2, it 
atomically adds w to the value of q and returns q's previous value. 
Lemmas 2.1 and 2.3 imply that: 

'*A spin lock is just a shared boolean flag that is raised and lowered by at most one 
processor at a time, while the other processors wait. 
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Lemma 5.1 Let x be the largest number yet returned by any increment re- 
quest on the counter. Let R be the set of numbers less than x which have not 
been issued to any increment request. Then 

1. The size of R is no greater than the number of operations still in 
-progress. 

2. If y ^ R, then y > x — w\R\ . 

3. Each number in R will be returned by some operation in time A-d-\-Ac, 
where d is the depth of the network, A is the maximum balancer delay, 
and Ac is the maximum time to update a cell on an output wire. 

5.2 Producer/ Consumer Buffer 

A producer/consumer buffer is a data structure in which items inserted by a 
pool of m producer processes are removed by a pool of m consumer processes. 
The buffer algorithm used here is essentially that of Gottlieb, Lubachevsky, 
and Rudolph [16]. The buffer is a lo-element array buff[0..w — 1]. There 
are two lo-width counting networks, a producer network, and a consumer 
network. A producer starts by traversing the producer network, leaving the 
network on wire i. It then atomically inspects buff[i], and, if it is _L, replaces 
it with the produced item. If that position is full, then the producer waits 
for the item to be consumed (or returns an exception). Similarly, a consumer 
traverses the consumer network, exits on wire j, and if buff[j] holds an item, 
atomically replaces it with _L. If there is no item to consume, the consumer 
waits for an item to be produced (or returns an exception). 
Lemmas 2.1 and 2.3 imply that: 

Lemma 5.2 Suppose m producers and m' consumers have entered a pro- 
ducer/consumer buffer built out of counting networks of depth d. Assume 
that the time to update each buff[i] once a process has left the counting net- 
work is negligible. Then ifm< m' , every producer leaves the network in time 
dA. Similarly, if m > m' , every consumer leaves the network in time dA. 

5.3 Barrier Synchronization 

A barrier is a data structure that ensures that no process advances beyond 
a particular point in a computation until all processes have arrived at that 
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point. Barriers are often used in highly- concurrent numerical computations 
to divide the work into disjoint phases with the property that no process 
executes phase i while another process concurrently executes phase 2 + 1. 

A simple way to construct an n-process barrier is by exploiting the fol- 
lowing key observation: Lemma 2.3 implies that as soon as some process 
exits with value n, the last phase must be complete, since the other n — 1 
processes must already have entered the network. 

We present a stronger result: one does not need a full counting network 
to achieve barrier synchronization. A threshold network of width is a 
balancing network with input sequence Xi and output sequence yi, such that 
the following holds: 

In any quiescent state, Uw-i = m if and only if mw < < 
(m + l)w. 

Informally, a threshold network can "detect" each time w tokens have passed 
through it. A counting network is a threshold network, but not vice- versa. 

Both the Block [lo] network used in the periodic construction and the 
Merger[io] network used in the bitonic construction are threshold networks, 
provided the input sequence satisfies the smoothness property. Recall that a 
sequence Xq, x^j^i is smooth if for all i < j , \xi — Xj\ < 1. Every sequence 
with the step property is smooth, but not vice- versa. The following two lem- 
mas state that smoothness is "stable" under partitioning into subsequences 
or application of additional balancers. 

Lemma 5.3 Any subsequence of a smooth sequence is smooth. 

Lemma 5.4 // the input sequence to a balancing network is smooth, so is 
the output sequence. 

Proof: Observe that if the inputs to a balancer differ by at most one, then 
so do its outputs. By a simple induction on the depth of the network, the 
output sequence from the balancers at any level of a balancing network with 
a smooth input sequence, is a permutation of its input sequence, hence, the 
network's output sequence is smooth. ■ 

Theorem 5.5 If the input sequence to BLOCK [lo] is smooth, then BLOCK [lo] 
is a threshold network. 
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Proof: Let Xi be the block's input sequence, Zi the output sequence of 
nested blocks A and B, and yi the block's output sequence. 

We first show that if Uw-i = m, then mw < Xi < (m + l)w. We argue 
by induction on w, the block's width. If = 2, the result is immediate. 
Assume the result for w = k and consider BLOCK [2A;] in a quiescent state 
where y2k-i = 'tt-- Since x is smooth by hypothesis, by Lemma 5.4 so are 
z and y. Since y2k-i and y2k-2 are outputs of a common balancer, y2k-2 is 
either m or m + 1. The rest is a case analysis. 

If 2/2fc-i = 2/2fc-2 = m, then Z2k-i = Z2k-2 = m. By the induction hy- 
pothesis and Lemma 5.3 applied to A and B, mk ^ Y^xf < (m + l)k and 
mk <J2 xf < (m + l)k, and therefore 2mA; <J2xf + J2 xf < 2(m + l)k. 

If 2/2fc-2 = m + 1, then one of zf and zf is m, and the other is m + 1. 
Without loss of generality suppose zf = m + 1 and zf = m. By the induction 
hypothesis, [m-\-l)k <Y^xf < [m -\- 2) k a.nd mk <Y^xf < [m-\-l)k. Since x 
is smooth, by Lemma 5.3 x^ is smooth and some element of x^ must be equal 
m, which in turn implies that no element of exceeds m + 1. This bound 
implies that (m + l)k = J^xf. It follows that 2mA; + k < J^xf + J^xf < 
2(m + l)k, yielding the desired result. 

We now show that if mw < '^Xi < (m + l)w, then y^-i = m. We again 
argue by induction on w, the block's width. If = 2, the result is immediate. 
Assume the result for w = k and consider BLOCK [2A;] in a quiescent state 
where 2mA; < J^Xi < 2(m + 1)A;. Since x is smooth, by Lemma 5.4 m < y2i-i- 
Furthermore, since x is smooth, by Lemma 5.3, either mk < Y^xf < (m + 1)A; 
and mk ^ Y xf < (m + l)k or vice versa, which by the induction hypothesis 
implies that zf_-^ + zf_-^ < 2m + 1. It follows that y2k-i < m + 1, which 
completes our claim. ■ 

The proof that the Merger[io] network is also a threshold network if its 
inputs are smooth is omitted because it is almost identical to that of Theorem 
5.5. A threshold counter is constructed by associating a local counter q with 
each output wire 2, just as in the counter construction. 

We construct a barrier for n processes, where n = 0 mod w, using a 
width-io threshold counter. The construction is an adaptation of the "sense- 
reversing" barrier construction of [18] as follows. Just as for the counter 
construction, we associate a local counter q with each output wire i. Let F 
be a boolean flag, initially false. Let a process's phase at a given point in the 
execution of the barrier algorithm be defined as 0 initially, and incremented 
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by 1 every time the process begins traversing the network. With each phase 
the algorithm will associate a sense, a boolean value reflecting the phase's 
parity: true for the flrst phase, false for the second, and so on. As illustrated 
in Figure 8, the token for process P, after a phase with sense 5, enters the 
network on wire P mod w. If it emerges with a value not equal to n — 1 mod n, 
then it waits until F agrees with s before starting the next phase. If it emerges 
with value n — 1 mod n, it sets F to s, and starts the next phase. 

As an aside, we note that a threshold counter implemented from a BLOCK [A;] 
network can be optimized in several additional ways. For example, it is only 
necessary to associate a local counter with wire w — 1, and that counter can 
be modulo n rather than unbounded. Moreover, all balancers that are not 
on a path from some input wire to exit wire w — 1 can be deleted. 

Theorem 5.6 If P exits the network with value n after completing phase (j), 
then every other process has completed phase (j), and no process has started 
phase (j) -\- 1. 

Proof: We flrst observe that the input to BLOCK [lo] is smooth, and there- 
fore it is a threshold network. We argue by induction. When P receives value 
V = n — 1 at the end of the flrst phase, exactly n tokens must have entered 
Block [lo], and all processes must therefore have completed the flrst phase. 
Since the boolean F is still false, no process has started the second phase. 
Assume the result for phase (j). If Q is the process that received value n — 1 
at the end of that phase, then exactly (j)n tokens had entered the network 
when Q performed the reset of F. If P receives value v = n — 1 at the end of 
phase 0 + 1, then exactly (0 + l)n tokens have entered the network, implying 
that an additional n tokens have entered, and all n processes have flnished 
the phase. No process will start the next phase until F is reset. ■ 



6 Performance 
6.1 Overview 

In this section, we analyze counting network throughput for computations 
in which tokens are eventually spread evenly through the network. As men- 
tioned before, to ensure that tokens are evenly spread across the input wires. 
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barrier() 

V := exit wire of traverse^wire P mod w) 
if V = n — 1 (mod w) 

then F := s 

else wait until F = s 

end if 
s := -15 
end barrier 



Figure 8: Barrier Synchronization Code 

each processor could be assigned a fixed input wire. Alternatively, processors 
could choose input wires at random. 

The network saturation 5" at a given time is defined to be the ratio of 
the number of tokens n present in the network (i.e. the number of proces- 
sors shepherding tokens through it) to the number of balancers. If tokens 
are spread evenly through the network, then the saturation is just the ex- 
pected number of tokens at each balancer. For the BiTONIC and PERIODIC 
networks, S = 2n/wd. The network is oversaturated if 5" > 1, and undersat- 
urated if 5" < 1. 

An oversaturated network represents a full pipeline, hence its throughput 
is dominated by the per-balancer contention, not by the network depth. If 
a balancer with S tokens makes a transition in time A(S'), then approxi- 
mately w/2 tokens emerge from the network every A(S') time units, yielding 
a throughput of w/2A[S). A is an increasing function whose exact form 
depends on the particular architecture, but similar measures of degradation 
have been observed in practice to grow linearly [5, 25]. The throughput of 
an oversaturated network is therefore maximized by choosing w and d to 
minimize S, bringing it as close as possible to 1. 

The throughput of an undersaturated network is dominated by the net- 
work depth, not by the per-balancer contention, since the network pipeline 
is partially empty. Every 1/S time units, w/2 tokens leave the network, 
yielding throughput The throughput of an undersaturated network is 

therefore maximized by choosing w and d to increase S, bringing it as close 
as possible to 1. 

This analysis is necessarily approximate, but it is supported by exper- 
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Figure 9: Bitonic Shared Counter Implementations 
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imental evidence. In the remainder of this section, we present the results 
of timing experiments for several data structures implemented using count- 
ing networks. As a control, we compare these figures to those produced by 
more conventional implementations using spin locks These implementations 
were done on an Encore Multimax, using Mul-T [21], a parallel dialect of 
Lisp. The spin lock is a simple "test-and-test-and-set" loop [26] written in 
assembly language, and provided by the Mul-T run-time system. In our 
implementations, each balancer is protected by a spin lock. 

6.2 The Shared Counter 

We compare seven shared counter implementations: bitonic and periodic 
counting networks of widths 16, 8, and 4, and a conventional spin lock im- 
plementation (which can be considered a degenerate counting network of 
width 2). For each network, we measured the elapsed time necessary for a 
2^° (approximately a million) tokens to traverse the network, controlling the 
level of concurrency. 

For the bitonic network, the width-16 network has 80 balancers, the 
width-8 network has 24 balancers, and the width-4 network has 6 balancers. 
In Figure 9, the horizontal axis represents the number of processes executing 
concurrently. When concurrency is 1, each process runs to completion be- 
fore the next one starts. The number of concurrent processes increases until 
all sixteen processes execute concurrently. The vertical axis represents the 
elapsed time (in seconds) until all 2^° tokens had traversed the network. With 
no concurrency, the networks are heavily undersaturated, and the spin lock's 
throughput is the highest by far. As saturation increases, however, so does 
the throughput for each of the networks. The width-4 network is undersatu- 
rated at concurrency levels less than 6. As the level of concurrency increases 
from 1 to 6, saturation approaches 1, and the elapsed time decreases. Beyond 
6, saturation increases beyond 1, and the elapsed time eventually starts to 
grow. The other networks remain undersaturated for the range of the exper- 
iment; their elapsed times continue to decrease. Each of the networks begins 
to outperform the spin lock at concurrency levels between 8 and 12. When 
concurrency is maximal, all three networks have throughputs at least twice 
the spin lock's. Notice that as the level of concurrency increases, the spin 
lock's performance degrades in an approximately linear fashion (because of 
increasing contention). 
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Figure 11: Producer/ Consumer Buffer Implementations 



The performance of the periodic network (Figure 10) is similar. The 
width-4 network reaches saturation 1 at 8 processes; its throughput then 
declines slightly as it becomes oversaturated. The other networks remain 
undersaturated, and their throughputs continue to increase. Each of the 
counting networks outperforms the spin lock at sufficiently high levels of 
contention. At 16 processes, the width-4 and width-8 networks have almost 
twice the throughput of the single spin-lock implementation. Each bitonic 
network has a slightly higher throughput than its periodic counterpart. 

6.3 Producer/ Consumer Buffers 

We compare the performance of several producer/consumer buffers imple- 
mented using the algorithm of Gottlieb, Lubachevsky, and Rudolph [16] dis- 
cussed in Section 5. Each implementation has 8 producer processes, which 
continually produce items, and 8 consumer processes, which continually con- 
sume items. If a producer (consumer) process finds its buffer slot full (empty), 
it spins until the slot becomes empty (full). 

We consider buffers with bitonic and periodic networks of width 2, 4, 
and 8. As a final control, we tested a circular buffer protected by a single 
spin lock, a structure that permits no concurrency between producers and 
consumers. Figure 11 shows the time in seconds needed to produce and 
consume 2^° tokens. Not surprisingly, the single spin-lock implementation is 
much slower than any of the others. The width-2 network is heavily over- 
saturated, the bitonic width-4 network is slightly oversaturated, while the 
others are undersaturated. 
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Figure 12: Barrier Implementations 



6.4 Barrier Synchronization 

Figure 12 shows the time (in seconds) taken by 16 processes to perform 2^^ 
barrier synchronizations. The remaining columns show Block[A;] networks 
of width 4, 8, and 16. The last column shows a simple sense-reversing barrier 
in which the BLOCK network is replaced by a single counter protected by a 
spin lock. The three network barriers are equally fast, and each takes about 
two-thirds the time of the spin-lock implementation. 

7 Verifying That a Network Counts 

The "0-1 law" states that a comparison network is a sorting network if (and 
only if) it sorts input sequences consisting entirely of zeroes and ones, a 
property that greatly simplifies the task of reasoning about sorting networks. 
In this section, we present an analogous result: a balancing network having m 
balancers is a counting network if (and only if) it satisfies the step property 
for all sequential executions in which up to 2™ tokens have traversed the 
network. This result simplifies reasoning about counting networks, since it 
is not necessary to consider all concurrent executions. However, as we show, 
the number of tokens passed through the network in the longest of these 
sequential executions cannot be less than exponential in the network depth. 
We begin by proving that it suffices to consider only sequential executions. 

Lemma 7.1 Let s be a valid schedule of a given balancing network. Then 
there exists a valid sequential schedule s' such that the number of tokens which 
pass through each balancer in s and s' is equal. 

Proof: Let s = Sq ■ p ■ q ■ Si, where Sq, Si are sequences of transitions, p and 
q are individual transitions involving distinct tokens P and Q, and where "•" 
is the concatenation operator. If p and q do not occur at the same balancer. 
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then Sq ■ q ■ p ■ Si is a valid schedule. If p and q do occur at the same balancer, 
then So-q-p-s[ is a valid schedule where s[ is constructed from Si by swapping 
the identities of P and Q. In each case we can swap p and q without changing 
the preceding sequence of transitions Sq and without changing the number 
of tokens that pass through any balancer during the execution. 

Now suppose that 5 is a complete schedule. We will transform it into a 
sequential schedule by a process similar to selection sorting. Choose some 
total ordering of the tokens in s. Split s into Sq ■ to where Sq is the empty 
sequence and to = s. Now repeatedly carry out the following procedure 
which constructs Si^i ■ ti^i from Si ■ ti'. while ti is nonempty let p be the 
earliest transition in ti whose token is ordered as less than or equal to all 
tokens in ti. Move p to the beginning of ti by swapping it with each earlier 
token in ti as described above, and let Si^i = Si ■ p and ti^i be the suffix of 
the resulting schedule after p. This procedure is easily seen to maintain the 
following invariant: 

1. After stage 2, Si ■ ti is a valid schedule in which each balancer passes 
the same number of tokens as in 5. 

2. After stage 2, Si is sorted by token. 

Thus when the procedure terminates, we have a valid sequential schedule 
s' in which each balancer passes the same number of tokens as in 5. ■ 

Theorem 7.2 A balancing network with m balancers satisfies the step prop- 
erty in all executions if (and only if) it satisfies it in all sequential executions 
in which at most 2™ tokens traverse the network. 

Proof: Since the step property depends only on the number of tokens that 
pass through the network's output wires, it follows from Lemma 7.1 that a 
balancing network satisfies the step property in all executions if (and only 
if) it satisfies it in all sequential executions. 

We now show that any failure to satisfy the step property can be de- 
tected in some execution involving at most 2™ tokens. Consider sequential 
executions of a balancing network with m balancers. Any quiescent state is 
characterized by specifying for each balancer the output wire to which it will 
send the next token, yielding a maximum of 2™ distinct quiescent states. In 
a sequential execution, each time a token traverses the network, it carries 
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the network from one quiescent state to another. Thus, in any execution, 
after at most 2™ traversals, the network must reenter its initial state. Let 
H be the shortest sequential execution needed to detect a violation of the 
step property. If H involves more than 2™ tokens, then H can be split into 
a prefix Hq and a suffix Hi such that Hq involves at most 2™ tokens and 
leaves the network in its initial state. If Hq sends "illegal" numbers of tokens 
through two output wires, then Hq alone suffices to detect the violation, and 
otherwise Hi alone suffices. ■ 

How tight is this bound? We now construct a balancing network that is 
not a counting network, yet satisfies the step property for any execution in 
which the number of tokens is less than exponential in the network depth. 
Through the remainder of this section we will only consider networks in 
quiescent states, so that we can ignore issues of timing and concentrate solely 
on the total number of tokens that have passed along each wire. 

First, consider the following balancing network STAGE [2io]. Take two 
counting networks A and B of width w having outputs wires ao through 
flu;-! and bo through 6u;-i respectively. Add a layer of w balancers such that 
the 2-th balancer has inputs and and outputs a[ and b'^_-^_^. The 

resulting network STAGE [2w] is not a counting network; however, it is easily 
extended to one by virtue of the following lemma. 

Lemma 7.3 For any input to STAGE [2w], there exists a permutation iTa of 
the output sequence Oq, . . . , a'^_-^ and a permutation iTb of the output sequence 
^0) • • • ) ^'w-i such that the sequence 7ra(aQ, . . . , a'^_-^) ■ TTbibg, . . . , b'^_-^) has the 
step property. 

Proof: Observe that the total inputs to any two balancers in the last layer 
differ by at most 1. 

Thus there is always a k such that every balancer in the last layer outputs 
either k or k -\- 1 tokens. If k is even, then b'^ = k/2 for all i and a'^ = 
ai + bui-i-i — k/2, which is either k/2 or k/2 -\- 1. One can obtain a sequence 
with the step property by setting tTq to sort the values in a'. If k is odd, 
then each a'- is [k + l)/2 and each b'- is a^i-i-i -\-bi — {k -\- l)/2, which will be 
either {k + l)/2 or {k + l)/2 — 1. In this case having VTb sort the values in b' 
produces the desired result. ■ 

By Lemma 2.2 it follows that 
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Corollary 7.4 For anym tokens input to STAGE [2w], Ei^o^ = T,T=o^ \'>^~ 
z/2w^ and ElL'o' K = E?rj' \m - z/2w^ . 

In other words, the total number of tokens that end up on the a'g, . . . , a'^_-^ 
and 6q, . . . , b'^_-^ outputs wires is the same as in a proper counting network. 
In fact, Lemma 7.3 guarantees an even stronger property: the actual number 
of tokens on each wire correspond to the number of tokens that occur on 
some wire in the output sequence of a proper counting network. However, 
there is no guarantee that these numbers appear in the correct order (or 
even the same order given different inputs). Because of Theorem 2.6, we can 
extend the STAGE [2io] network into a (not very efficient) counting network 
by passing the outputs Oq, . . . , a'^_-^ and 6q, . . . , b'^_-^ to two separate balanc- 
ing networks isomorphic to sorting networks. But we are not interested in 
getting a working counting network; instead we will use a modified version 
of Stage [2io] to construct a balancing network which counts all input se- 
quences with up to some bounded number of tokens, but fails on sequences 
with more tokens. 

We construct such a balancing network (denoted ALMOST [2w]) as follows. 
Take a STAGE [2w] network and modify it by picking some x other than 0 or 
10 — 1 and deleting the final balancer between and h^i-i-x- Denote this 
balancing network as Stage^[2io]. Let ALMOST [2w\ be the periodic network 
constructed from k stages, for some k > each a Stage^[2io] network, with 
the outputs of each stage connected to the inputs of the next. 

Let At and Bt be the sums of the number of tokens input to each of the two 
subnetworks A and B in the t-th stage of ALMOST [2w\. Aq and Bq are thus 
the numbers of tokens input to A and B respectively. Let y = {yo, . . . , y2w-i} 
be the sequence given by yi = \{^Ao-\- Bq—i) j 2w \ . Thus, yi counts the number 
of tokens that would exit on output wire i if ALMOST [2k] were a counting 
network. 

We now define the quantities A^o and B^o used in the proofs below. They 
measure the number of tokens that would have come out of the respective 
parts of network in the last stage {t = oo) if it were a counting network. 
Formally, let A^o = T,T=o Vi^ ^oo = Ei=J^2/i- Note that At + Bt = 
Aq -\- Bq = Aoo + -Boo for all t and that by Lemma 2.2, \{Aoo — O/'^^l ~ Vi 
and \{Boo - i)/w'] = y^+i for all i. 

Finally, let the imbalance St = At — Aoo = —{Bt — Boo)', this quantity 
represents "how far" the network is from balancing the tokens between the 
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A and B subnetworks in stage t, in other words, how many excess tokens 
must be moved from the A part of the network to the B part (or, if the 
quantity is negative, how many tokens should be moved from B to A). 

The following lemma follows from arguments almost identical to those of 
Lemma 5.4. 

Lemma 7.5 If the input sequence to a balancing network has the step prop- 
erty, then so does the output sequence. 

Lemma 7.6 In the output sequence of stage t o/ ALMOST [2io]^ each ai is 
equal to yi + ei, where < 0 when St < 0^ and > 0 when St > 0; and each 
hi is equal to y^+i + e^i+i, where < 0 when St > 0^ and > 0 when < 0. 

Proof: For i < w we have 

— (^i Vi 

= \{At - 2)H - \{A^ - 2)H 

which is at least zero when ^ > 0 and at most zero when ^ < 0. 

The claim for e^j^i = hi — y^+i follows by a similar argument. ■ 

Corollary 7.7 IfSt=0 then the output sequences of stage t o/ ALMOST [2w] 
have the step property. 

Proof: If = 0 then by the preceding lemma each ai = yi and hi = y^+i, 
so the output sequences of stage t form the sequence y. Since y has the step 
property it is left unchanged by the final layer of balancers (Lemma 7.5). ■ 

Lemma 7.8 St+r = . 

Proof: If a balancer were placed between a'^ and h'^_-^_^ after stage t, then 
the Stage^[2io] network would become a STAGE [2w] counting network, and 
by Corollary 7.4, exactly A^o tokens would emerge from the A half of the 
network after stage t + 1, giving an imbalance would be 0. The above quantity 
^f+i is simply the number of tokens that this balancer would move from the 
A part of the network to the B part in order to bring the parts into balance, 
and is thus the actual imbalance that results from deleting the balancer. ■ 
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The following lemmas show that the imbalance tends toward zero as more 
stages are added: 

Lemma 7.9 If St > 0 then 6t+i > 0. If St < 0 then 6t+i < 0. 
Proof: Suppose St > 0. Then At > Aoo and Bt < -Boo, and so 

\{At - x)H - \{Bt -{w-1- x))/w'] 
2 

[(Aoo - x)/w'] - \{B^ - {w-1- x))/w'] 
2 

= 0. 

(The last equality holds because when the two parts of the network hold Aoo 
and Boo tokens there is no imbalance.) 

Reversing the inequalities gives the corresponding result for < 0. ■ 

Lemma 7.10 // \St\ > 0 then \St+i\ < \St\ - 1. 

Proof: By virtue of Lemma 7.9 we need only show that S decreases when 
positive and increases when negative. 

Let floj • • • J (^w-i, ^Oj • • • J b-w-i be the outputs of the A and B subnetworks 
of the [t + l)-th stage before the last layer of balancers. Because St ^ 0, 
this sequence does not have the step property; however, each of the two 
subsequences aoj • • • cLw-i and bo, ... , 6u;-i is the output of a counting network 
and so has the step property. Thus the step property of the whole sequence 
must be violated by some a^, bj such that — bj is either less than 0 or 
greater than 1. 

We will consider two cases, depending on the sign of St'. 

Case 1. St < 0. Then by Lemma 7.6 each < yi and each bj > Uw+j- (Recall 
that Ui is the number of tokens that would exit from the 2-th output 
of a counting network with the same input sequence.) So for each 
ai and each bj we have, using the step property of the y sequence, 
a-i <yi < Vw+j + 1 < bj + 1. Thus: 

1. For each ai and ai < + 1, so the balancer between 

these outputs moves no tokens from the A side to the B side. 
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2. Given some and bj that violate the step property, it cannot 
be the case that > bj -\- 1 and thus it must be the case that 
ai < bj. But then a^;-! ^ Q^i < bj < bo, and since a^;-! and bo are 
connected by a balancer, that balancer moves at least one token 
from the B side to the A side. 

Hence at least one token moves from the B side to the A side and 
^f+i > St. 

Case 2. St > 0. Then each ai > yi and each bi < y^+i- So ai > yi > y^+i > bi. 
Thus: 

1. For each ai and ai > so no final-stage balancer 
moves tokens from the B side to the A side. 

2. Given some ai and 6j that violate the step property, it must be 
the case that ai > bj + 2. But ao > ai > bj -\- 2 > 6u;-i + 2; so the 
balancer between ao and 6u;-i moves at least one token from the 
A side to the B side. 

Hence at least one token moves from the A side to the B side and 
^f+i < St. 



Lemma 7.11 St+i = St/w + c where —3/2 < c < 3/2. 

Proof: From Lemma 7.8 we have: 

\{At - x)/«;] - \{Bt -{w-1- x))/w'] 



B+x+l 



Looking more closely at the Bt term, notice that 

1. If is not an integer then this is just [~f^J j which is equal to 

["^^^J ^i^^^ subtracting 1 from the numerator cannot bring it below the next 

integral multiple of w. Now if ^^^^^^^ is an integer then this is [ '^"'"^"'""^ j — 1 

which in this case is equal to ["^^J since subtracting 1 from the numerator 
does bring it below an integral multiple of w. So in either case we have 
|"s-(u^^^i-x)j _ ^ rewrite the original expression as: 
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\{At - x)/w-] - [{Bt + x)/w\ 



{At - x)/w - {Bt + x)lw + ci 



At-Bt 



2w 

26t + {Aq, - B, 
2w 



X C\ 

w 1 



X Ci 



C2 



where 0 < Ci < 2 and 0 < C2 < 1. Using the fact that 0 < Aoo — Boo ^ '^^ 
(hence 0 < {Aoo - Boo)/2w < 1/2), and that 0 < x < w - 1 (hence 1/2 < 
—x/w < 0), we can rewrite all of the terms not containing ^ as a single value 
c and get 

^f+i = — + c 
w 

where the bound —3/2 < c < 3/2 is obtained by summing the bounds on the 
individual terms. ■ 



Theorem 7.12 Let w be a power of 2 greater than 1. Then there exists a 
width-2w balancing network that has the step property in all executions with 
up to w^^~^^ tokens, yet is not a counting network. 

Proof: From Lemma 7.11 we have I ^t_l_i I < |^t|/io + 3/2. Let C/(t) be defined 
by the recurrence C/(0) = |^o|, U{t + 1) = U{t)/w + 3/2; then U{t) is a strict 
upper bound on \6t\ for t > 0. Solving the recurrence using standard methods 
yields U{t) = \6o\w-^ + ^-{^)w-K 

Now suppose the network is given an input involving at most w* tokens. 
Then |^o| cannot possibly exceed lo*, and after t stages \6t\ < U{t) < 1 + 
S'^iV i ^'^^V ] w~* , which is at most 4 if > 2 and t > 1. So by Lemma 

l — l/w \w — l J ' — — 

7.10, |^f+4| = 0 and thus by Corollary 7.7 the outputs of stage t + 4 have the 
step property. Thus a network with k = t -\- A stages will count up to w^''~^^ 
tokens. 

To see that this A;-stage network is not a counting network, suppose |^o| > 
4i(;(fc+i). From Lemma 7.11 we have |^f+i| > \6t\/w — 3/2. Let iv(t) be defined 
by L{0) = \6o\ and L{t + 1) = L{t)/w — 2; L{t)is a strict lower bound on \6t\ 
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for t > 0. Solving the recurrence gives L[t) = \6o\w * — .^^^^j^ + (j^^^ w *. 
Dropping the last term and setting |^o| > ^w^^'^^^ gives |^fc+i| > L{k + 1) > 
4 — l^t^^i^^ > 1- Since 8k+i ^ 0, the outputs of stage k (and hence the entire 
network) cannot have the step property. ■ 

8 Discussion 

Counting networks deserve further study. We believe that they represent 
a start toward a general theory of low-contention data structures. Work is 
needed to develop other primitives, to derive upper and lower bounds and 
new performance measures. We have made a start in this direction by deriv- 
ing constructions and lower bounds for linearizable counting networks [20], 
networks which guarantee that the values assigned to tokens reflect the real- 
time order of their traversals. Aharonson and Attiya [3], Felton, LaMarca, 
and Ladner [11], and Hardavellas, Karakos, and Mavronicolas [17] have in- 
vestigated the structure of counting networks with fan-in greater than two. 
Klugerman and Plaxton [23] have shown an explicit network construction of 
depth O(ci°s*"logn) for some small constant c, and an existential proof of a 
network of depth (9(logn). 

Work is also needed in experimental directions, comparing counting net- 
works to other techniques, for example those based on exponential backoff 
[1], and for understanding their behavior in architectures other than the 
single-bus architecture provided by the Encore. We have made a start in 
this direction by comparing the performance of counting networks to that of 
known methods using the ASIM simulator of the MIT Alewife machine [19]. 
Preliminary results show that there is a substantial gain in performance due 
to parallelism on such distributed memory machines. 

Finally, we point out that smoothing networks, balancing networks that 
smooth but do not necessarily count, are interesting in their own right since 
they can be used as hardware solutions to problems such as load balancing 
(cf. [28]). 
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